TFrecords (Part 1): Converting a DataSet into tfrecord files
When taking ML/Deep Learning Courses, it is common to have access to pre-loaded datasets like MNIST, FashionMNIST etc. as they come with Deep Learning frameworks like Tensorflow. But in real-world applications, your dataset would most likely not be readily available like this. Also, It might be too large to fit into memory at once.
TFrecords offer a unique solution to the problem of training models with large datasets. A tensorflow record (tfrecord) is a binary file format designed to be efficient for storing and loading large datasets. They make training an ML model easy and work well with different Deep Learning and Machine Learning libraries.
HOW?
Here, I’d be showing how to use tfrecords. First, you have to convert your dataset into tfrecords, how to read a tfrecord file and finally, how to train a Machine Learning Model using tfrecords.
In this part, I would be focusing solely on converting your dataset into tfrecords.
For this, I’d be using the RSNA Screening Mammography Breast Cancer Detection dataset. It contains over 50,000 medical images and is over 300GB in size, so it’s perfect for this. It also has a train.csv
file that contains other useful information about the images and the target.
The first step is usually to load your dataset into a juypter notebook session or your local computer. I’m using a Kaggle Notebook, so this is fairly straightforward.
Here’s a snapshot of the dataset and the first two columns of the train.csv
file.
The focus here would be to have the image and its corresponding target value as a record/example in the tfrecord file.
The first step is to specify the number of records we want in each tfrecord file. Usually, the number of records depends on your dataset. I’d be using 1000 records here.
NUM_RECORDS = 1000
Next, based on the number of records, we figure out how many tfrecord files we would need to store the entire dataset.
num_tfrecords = max(train.shape) // NUM_RECORDS
if max(train.shape) % NUM_RECORDS:
num_tfrecords += 1 #add one record if there are any remaining samples
Then we define some helper functions.
To use tfrecords, we have to define helper functions that would help turn the data into a tf.train.Feature
object which would later be used as a feature in the tf.train.Example
protocol buffer. Here we will need different helper functions to handle the different data types in our dataset. Since there are two datatypes in the dataset, np.ndarray
and int64
, two helper functions are defined.
Here are the helper functions:
def image_feature(value):
"""Returns a bytes_list from a string / byte."""
return tf.train.Feature(
bytes_list=tf.train.BytesList(value=[tf.io.encode_png(value).numpy()])
)
def int64_feature(value):
"""Returns an int64_list from a bool / enum / int / uint."""
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
tf.io.encode_png
converts and compresses tensors to png. tf.io
also contains a encode_jpg
function to convert tensors to jpg.
Here, I show an additional step for processing the images from Dicom files. Dicom (`.dcm`) files are used for storing medical images and also contain information about the patients. To read .dcm
files in Python, use the pydicom
library. Here’s an example
import pydicom as dicom
dcm_path = '.../10006/1459541791.dcm'
ds = dicom.dcm_read(dcm_path) #read dcm file from directory
image = ds.pixel_array #get image
With this information, we create a preprocessing function that reads the dcm file from path and converts it to a (224, 224, 1)
tensor.
def process_image(dcm_path):
"""Read Image from path and resize it to (224, 224)"""
ds = dicom.dcmread(image_path)
image = cv2.resize(ds.pixel_array, [224, 224]).reshape(224, 224, 1)
return image
Next, we create a function that writes each example into the tfrecord file. The function takes the preprocessed image and target value as input and returns an instance of tf.train.Example
. Each feature is processed using the helper functions defined above.
def create_example(image, target):
"""Write example """
feature = {
"image": image_feature(image),
"target": int64_feature(target)
}
return tf.train.Example(features=tf.train.Features(feature=feature))
Finally, usingtf.io.TFRecordWriter
, we write the dataset into tfrecord files.
for tfrec_num in range(num_tfrecords):
samples = train['image_id'][(tfrec_num * NUM_RECORDS) : ((tfrec_num + 1) * NUM_RECORDS)]
tf_dir = tfrecords_dir + f'/tfrecord_{tfrec_num * NUM_RECORDS}-{(tfrec_num + 1) * NUM_RECORDS}.tfrec'
with tf.io.TFRecordWriter(tf_dir) as writer:
for sample in samples:
image_path = train[train['image_id'] == sample]['image_path'].iloc[0]
image = process_image(image_path)
target = train[train['image_id'] == sample]['cancer'].iloc[0]
record = create_example(image, target)
writer.write(record.SerializeToString())
The code above writes the dataset into tfrecord files. The first two lines get the samples/data to be added to the tfrecord file and name the file as tfrecord_1000–2000.tfrec
(for samples 1000 to 2000).
Data is written to the tfrecord file one at a time. So using a for
loop, we loop over the samples and using tf.io.TFRecordWriter
write each sample (example) to the tfrecord file. On the last line, notice the use of write
and SerializeToString
methods. The write
method writes the example and SerializeToString
converts the example to a binary string.
In the second part, we focus on reading data from a tfrecord file and training a Deep Neural Network with tfrecord files.
You can read it here: TFRecords (Part 2): Reading and training models with Tfrecords.
by sodipe🌚 on Feburary 4, 2023